Use simple linear regression to describe the relationship between a quantitative predictor and quantitative response variable.
Estimate the slope and intercept of the regression line using the least squares method.
Interpret the slope and intercept of the regression line.
The library() function is used to access functionality that is provided by R packages, but is not included in base R. install.packages() can be used to install new packages. Run this command from the console.
# install.packages("ggplot2")
First, load the package ggplot2 that will be used throughout the tutorial for data visualizations.
library(tidyverse)
This tutorial will be using the nhanes dataset where the variables are described in the file nhanes-codebook.txt. Load this data with the load function and specify the rda data file.
load(file='nhanes1518.rda')
The functions head() and names() can be used to explore the data.
head(nhanes1518)
## # A tibble: 6 × 52
## SEQN WTINT2YR WTMEC…¹ RIDST…² RIAGE…³ RIDAG…⁴ INDFM…⁵ RIDRE…⁶ DMDED…⁷ DMDED…⁸
## <int> <dbl> <dbl> <int> <int> <int> <dbl> <int> <int> <int>
## 1 83732 134671. 135630. 2 1 62 4.39 3 5 NA
## 2 83733 24329. 25282. 2 1 53 1.32 3 3 NA
## 3 83734 12400. 12576. 2 1 78 1.51 3 3 NA
## 4 83735 102718. 102079. 2 2 56 5 3 5 NA
## 5 83736 17628. 18235. 2 2 42 1.23 4 4 NA
## 6 83737 11252. 10879. 2 2 72 2.82 1 2 NA
## # … with 42 more variables: INDHHIN2 <int>, BMXBMI <dbl>, BMXWAIST <dbl>,
## # BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>, URXCNP <dbl>,
## # URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>, URDECPLC <int>,
## # URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>, URDMBPLC <int>, URXMC1 <dbl>,
## # URDMC1LC <int>, URXMCOH <dbl>, URDMCOLC <int>, URXMEP <dbl>,
## # URDMEPLC <int>, URXMHBP <dbl>, URDMHBLC <int>, URXMHH <dbl>,
## # URDMHHLC <int>, URXMHNC <dbl>, URDMCHLC <int>, URXMHP <dbl>, …
names(nhanes1518)
## [1] "SEQN" "WTINT2YR" "WTMEC2YR" "RIDSTATR" "RIAGENDR" "RIDAGEYR"
## [7] "INDFMPIR" "RIDRETH1" "DMDEDUC2" "DMDEDUC3" "INDHHIN2" "BMXBMI"
## [13] "BMXWAIST" "BMXWT" "BMXHT" "URXUCR" "WTSB2YR" "URXCNP"
## [19] "URDCNPLC" "URXCOP" "URDCOPLC" "URXECP" "URDECPLC" "URXHIBP"
## [25] "URDHIBLC" "URXMBP" "URDMBPLC" "URXMC1" "URDMC1LC" "URXMCOH"
## [31] "URDMCOLC" "URXMEP" "URDMEPLC" "URXMHBP" "URDMHBLC" "URXMHH"
## [37] "URDMHHLC" "URXMHNC" "URDMCHLC" "URXMHP" "URDMHPLC" "URXMIB"
## [43] "URDMIBLC" "URXMNP" "URDMNPLC" "URXMOH" "URDMOHLC" "URXMZP"
## [49] "URDMZPLC" "WTINT4YR" "WTMEC4YR" "WTSB4YR"
We will first explore the dataset by doing exploratory data analysts on the predictor variable age:
# observations with age = 0
nhanes1518%>%filter(RIDAGEYR==0)
## # A tibble: 753 × 52
## SEQN WTINT…¹ WTMEC…² RIDST…³ RIAGE…⁴ RIDAG…⁵ INDFM…⁶ RIDRE…⁷ DMDED…⁸ DMDED…⁹
## <int> <dbl> <dbl> <int> <int> <int> <dbl> <int> <int> <int>
## 1 83765 13932. 14073. 2 2 0 3.17 3 NA NA
## 2 83782 8735. 0 1 2 0 0.98 3 NA NA
## 3 83798 18168. 18352. 2 2 0 5 5 NA NA
## 4 83840 5844. 6011. 2 2 0 0.75 1 NA NA
## 5 83852 17442. 18896. 2 2 0 5 3 NA NA
## 6 83858 5770. 5651. 2 1 0 NA 1 NA NA
## 7 83939 17961. 18143. 2 2 0 2.99 3 NA NA
## 8 83942 12427. 12481. 2 1 0 5 3 NA NA
## 9 83956 15425. 16860. 2 2 0 4.12 3 NA NA
## 10 83957 5478. 6025. 2 2 0 0.49 1 NA NA
## # … with 743 more rows, 42 more variables: INDHHIN2 <int>, BMXBMI <dbl>,
## # BMXWAIST <dbl>, BMXWT <dbl>, BMXHT <dbl>, URXUCR <dbl>, WTSB2YR <dbl>,
## # URXCNP <dbl>, URDCNPLC <int>, URXCOP <dbl>, URDCOPLC <int>, URXECP <dbl>,
## # URDECPLC <int>, URXHIBP <dbl>, URDHIBLC <int>, URXMBP <dbl>,
## # URDMBPLC <int>, URXMC1 <dbl>, URDMC1LC <int>, URXMCOH <dbl>,
## # URDMCOLC <int>, URXMEP <dbl>, URDMEPLC <int>, URXMHBP <dbl>,
## # URDMHBLC <int>, URXMHH <dbl>, URDMHHLC <int>, URXMHNC <dbl>, …
With the number of 0 age observations, we want to limit our attention to adults to investigate the effect of age on BMI
nhanes1518 <- nhanes1518%>%filter(RIDAGEYR>=18)
# Basic histogram
ggplot(nhanes1518, aes(x=RIDAGEYR)) + geom_histogram()+labs(x = "waist circumference", title = "Distribution of Age")
# Change the width of bins
ggplot(nhanes1518, aes(x=RIDAGEYR)) +
geom_histogram(binwidth=1)+labs(x = "Age", title = "Distribution of Age")
# Change colors
p<-ggplot(nhanes1518, aes(x=RIDAGEYR)) +
geom_histogram(color="black", fill="white")+labs(x = "Age", title = "Distribution of Age")
We see that the distribution of waist circumference is asymmetric, with peaks at age of around 80 and 0 respectively.
We can also add the mean line, as well as overlay with transparent density plot. The value of alpha controls the level of transparency:
# Add mean line
p+geom_vline(aes(xintercept=mean(RIDAGEYR)),
color="blue", linetype="dashed", size=1)
# Histogram with density plot
ggplot(nhanes1518, aes(x=RIDAGEYR)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")
Simple linear regressions take the form
\[Y_i = \beta_0 +\beta_1 X_i +\epsilon_i\] Where \(Y_i\) is the dependent variable, \(X_i\) is the independent variable and \(\epsilon_i\) is the random error term.
We’ll start with a fitting a simple linear model using the lm() function. Instead of attaching the nhanes1518 dataset, we also can specify the data from the lm() function. In the lm() function, the first variable is the response variable and the variables to the right of the ~ symbol are the predictor variable(s). Here we use age as the response, and BMI as the predictor variables.
lm.fit <- lm(RIDAGEYR ~ BMXBMI, data = nhanes1518)
There are several ways that we can examine the model results. First, we can just call the name of the lm() model for a brief summary.
lm.fit
##
## Call:
## lm(formula = RIDAGEYR ~ BMXBMI, data = nhanes1518)
##
## Coefficients:
## (Intercept) BMXBMI
## 45.4146 0.1153
We can also use the names() function to list all of the names of variables in the lm.fit model:
names(lm.fit)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "na.action" "xlevels" "call" "terms"
## [13] "model"
The summary() function gives a more extensive overview of the model fit:
summary(lm.fit)
##
## Call:
## lm(formula = RIDAGEYR ~ BMXBMI, data = nhanes1518)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.786 -16.164 0.086 14.837 32.948
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.41461 0.73402 61.871 < 2e-16 ***
## BMXBMI 0.11531 0.02413 4.778 1.79e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.48 on 11094 degrees of freedom
## (752 observations deleted due to missingness)
## Multiple R-squared: 0.002053, Adjusted R-squared: 0.001963
## F-statistic: 22.83 on 1 and 11094 DF, p-value: 1.795e-06
The coefficients of the linear regression model can be extracted using the coef() function and the confidence interval(s) with the confint() function.
coef(lm.fit)
## (Intercept) BMXBMI
## 45.4146068 0.1153059
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 43.97580422 46.8534094
## BMXBMI 0.06799999 0.1626119
We can use the predict() function to obtain prediction intervals or confidence intervals for a given value of the predictor variable, BMXWAIST. Note that when using the predict function, the column names and format of the new points at which to predict needs to be the same as the original data frame used to fit the lm() model. If you encounter errors using the predict() function, this is a good first thing to check.
predict(lm.fit, data.frame(BMXBMI = (c(70, 80, 90))), interval = "confidence")
## fit lwr upr
## 1 53.48602 51.54110 55.43094
## 2 54.63908 52.22710 57.05106
## 3 55.79214 52.91114 58.67314
predict(lm.fit, data.frame(BMXBMI = (c(70, 80, 90))), interval = "prediction")
## fit lwr upr
## 1 53.48602 17.21825 89.75379
## 2 54.63908 18.34327 90.93489
## 3 55.79214 19.46215 92.12213
We can plot the variables BMXBMI and BMXWAIST using the plot() function and overlay the regression line found using lm() with the abline() function.
plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Age",ylab="Body Mass Index (kg/m**2)")
abline(lm.fit)
Here the dollar sign specifies that we retrieve the predictor columns BMXBMI and BMXWAIST from the dataset nhanes1518
We can experiment with different options for abline() by changing the line width and color in abline().
plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)")
abline(lm.fit, lwd = 3, col = "red")
plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)", col = "red")
The pch argument in plot() changes the shape/type of the points that are plotted.
plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)",pch = 20)
plot(x=nhanes1518$RIDAGEYR,y=nhanes1518$BMXBMI, xlab="Body Mass Index (kg/m**2)",ylab="Waist Circumference (cm)",pch = "+")
Optional: We can make a similar plot using ggplot, where we fit the linear regression model using ggplot().
library(ggplot2)
ggplot(nhanes1518, aes(x = RIDAGEYR, y = BMXBMI)) +
geom_smooth(method = "lm", formula = y ~ x, colour = "blue") +
geom_point() +
ggtitle("Age vs. BMI for the nhanes data")
We can use the residuals() and rstudent() functions to extract the residuals and studentized residuals, respectively, from the linear model and plot them along with the predicted values.
plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))
Additionally, we can compute the influence matrix for the predictors using the hatvalues() function.
plot(hatvalues(lm.fit))
which.max(hatvalues(lm.fit))
## 8723
## 8196
The par() function can be used to create a grid of multiple subplots.
par(mfrow = c(2, 2))
plot(lm.fit)
The diagnostic plots show residuals in four different ways:
Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good. The model we fitted shows roughly a linear relationship, with no distinct patterns (such as a fan or funnel shape) in the residuals vs. fitted plot.
Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line. The Q-Q plot generally follows the straight dashed line, with some deviations at the end towards high values of theoretical quantiles.
Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.
Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. Based on the residuals vs. leverage plot, there are no influential points according to Cook’s distance. However, there might be some points with high standard residuals values which could be marked as outliers.
Some other metrics from the model:
The lm() function can also fit multiple regression models. In this section, we will use RIDAGEYR, BMXWAIST, BMXWT, and BMXHT as predictors of the response variable BMXBMI.
lm.fit <- lm(BMXBMI ~ RIDAGEYR+ BMXWAIST + BMXWT + BMXHT, data = nhanes1518)
summary(lm.fit)
##
## Call:
## lm(formula = BMXBMI ~ RIDAGEYR + BMXWAIST + BMXWT + BMXHT, data = nhanes1518)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1176 -0.3262 0.0503 0.3471 8.7914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.4364793 0.2080287 276.1 < 2e-16 ***
## RIDAGEYR -0.0039875 0.0005317 -7.5 6.88e-14 ***
## BMXWAIST 0.0166524 0.0015277 10.9 < 2e-16 ***
## BMXWT 0.3443669 0.0012653 272.2 < 2e-16 ***
## BMXHT -0.3463345 0.0011206 -309.1 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8401 on 10529 degrees of freedom
## (1314 observations deleted due to missingness)
## Multiple R-squared: 0.986, Adjusted R-squared: 0.986
## F-statistic: 1.849e+05 on 4 and 10529 DF, p-value: < 2.2e-16
BMXWT: The coeffcient for the predictor BMXWT is 0.2420969, which means that for every unit increase in the participant’s waist circumference, the BMI is expected to increase by 0.2420969 on average, holding all else constant (holding all other predictor variables, BMXWAIST, BMXWT, and BMXHT constant)In the lm() formula, a dot . can be used to include all variables in the NHANES data as predictors.
nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI)
# data clearning to ignore NA values
lm.fit1 <- lm(nhanes1518$BMXBMI ~ ., data = nhanes_na_removed)
summary(lm.fit1)
##
## Call:
## lm(formula = nhanes1518$BMXBMI ~ ., data = nhanes_na_removed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.957e-13 -1.310e-16 2.400e-17 1.780e-16 5.827e-15
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.587e-15 4.726e-16 -1.605e+01 < 2e-16 ***
## SEQN -6.901e-21 4.878e-21 -1.415e+00 0.1572
## WTINT2YR -1.533e-20 7.213e-21 -2.125e+00 0.0336 *
## WTMEC2YR 1.481e-20 6.928e-21 2.137e+00 0.0326 *
## RIDSTATR NA NA NA NA
## RIAGENDR -2.707e-16 5.427e-17 -4.988e+00 6.2e-07 ***
## `nhanes1518$BMXBMI` 1.000e+00 3.726e-18 2.684e+17 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.842e-15 on 11090 degrees of freedom
## (752 observations deleted due to missingness)
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.451e+34 on 5 and 11090 DF, p-value: < 2.2e-16
If we want to exclude specific variables from the list of predictors, we can use the - notation. In the following example, all predictor variables but age are included in the model.
library(dplyr)
nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI)
# data clearning to ignore NA values
lm.fit1 <- lm(nhanes1518$BMXBMI ~ . - RIDSTATR, data = nhanes_na_removed)
summary(lm.fit1)
##
## Call:
## lm(formula = nhanes1518$BMXBMI ~ . - RIDSTATR, data = nhanes_na_removed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.957e-13 -1.310e-16 2.400e-17 1.780e-16 5.827e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.587e-15 4.726e-16 -1.605e+01 < 2e-16 ***
## SEQN -6.901e-21 4.878e-21 -1.415e+00 0.1572
## WTINT2YR -1.533e-20 7.213e-21 -2.125e+00 0.0336 *
## WTMEC2YR 1.481e-20 6.928e-21 2.137e+00 0.0326 *
## RIAGENDR -2.707e-16 5.427e-17 -4.988e+00 6.2e-07 ***
## `nhanes1518$BMXBMI` 1.000e+00 3.726e-18 2.684e+17 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.842e-15 on 11090 degrees of freedom
## (752 observations deleted due to missingness)
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.451e+34 on 5 and 11090 DF, p-value: < 2.2e-16
Including -1 excludes the intercept from the model.
lm.fit1 <- lm(nhanes1518$BMXBMI ~ .- 1, data = nhanes_na_removed)
summary(lm.fit1)
##
## Call:
## lm(formula = nhanes1518$BMXBMI ~ . - 1, data = nhanes_na_removed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.434e-12 -3.000e-16 5.000e-16 1.300e-15 9.070e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## SEQN -2.674e-19 8.860e-20 -3.018e+00 0.00255 **
## WTINT2YR -5.118e-19 1.310e-19 -3.907e+00 9.41e-05 ***
## WTMEC2YR 4.943e-19 1.259e-19 3.928e+00 8.61e-05 ***
## RIDSTATR -8.811e-14 4.292e-15 -2.053e+01 < 2e-16 ***
## RIAGENDR -7.173e-15 9.858e-16 -7.277e+00 3.65e-13 ***
## `nhanes1518$BMXBMI` 1.000e+00 6.768e-17 1.478e+16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.163e-14 on 11090 degrees of freedom
## (752 observations deleted due to missingness)
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.418e+32 on 6 and 11090 DF, p-value: < 2.2e-16
The update() function can be used to specify a new formula for an existing model.
lm.fit1 <- update(lm.fit, ~. - RIDSTATR)
There are two ways to include interaction terms in the model, : and *. The : symbol only includes the interaction term between the two variables, while the * symbol includes the variables themselves, as well as the interaction terms. This means that BMXWT*BMXWAIST is equivalent to BMXWT + BMXWAIST + BMXWT:BMXWAIST.
summary(lm(BMXBMI ~ BMXWT* BMXWAIST, data = nhanes1518))
##
## Call:
## lm(formula = BMXBMI ~ BMXWT * BMXWAIST, data = nhanes1518)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6955 -1.7808 -0.2051 1.5276 20.6194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.379e+00 4.769e-01 2.892 0.00384 **
## BMXWT 4.045e-02 6.577e-03 6.150 8.01e-10 ***
## BMXWAIST 1.913e-01 5.180e-03 36.925 < 2e-16 ***
## BMXWT:BMXWAIST 6.650e-04 5.117e-05 12.997 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.71 on 10530 degrees of freedom
## (1314 observations deleted due to missingness)
## Multiple R-squared: 0.8539, Adjusted R-squared: 0.8538
## F-statistic: 2.051e+04 on 3 and 10530 DF, p-value: < 2.2e-16
A simple way to include all interaction terms is the syntax .^2.
library(dplyr)
nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI)
# data clearning to ignore NA values
summary(lm(nhanes1518$BMXBMI ~ .^2, data = nhanes_na_removed))
##
## Call:
## lm(formula = nhanes1518$BMXBMI ~ .^2, data = nhanes_na_removed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.956e-13 -1.400e-16 1.900e-17 1.930e-16 5.786e-15
##
## Coefficients: (6 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.194e-15 2.375e-15 -2.187e+00 0.0288 *
## SEQN -1.453e-20 2.518e-20 -5.770e-01 0.5640
## WTINT2YR -1.138e-19 1.393e-19 -8.170e-01 0.4138
## WTMEC2YR 1.221e-19 1.344e-19 9.090e-01 0.3634
## RIDSTATR NA NA NA NA
## RIAGENDR -1.760e-15 9.395e-16 -1.873e+00 0.0611 .
## `nhanes1518$BMXBMI` 1.000e+00 6.442e-17 1.552e+16 <2e-16 ***
## SEQN:WTINT2YR 7.582e-25 1.433e-24 5.290e-01 0.5967
## SEQN:WTMEC2YR -8.528e-25 1.384e-24 -6.160e-01 0.5377
## SEQN:RIDSTATR NA NA NA NA
## SEQN:RIAGENDR 1.502e-20 9.817e-21 1.530e+00 0.1261
## SEQN:`nhanes1518$BMXBMI` -3.974e-22 6.755e-22 -5.880e-01 0.5564
## WTINT2YR:WTMEC2YR -4.070e-27 7.030e-27 -5.790e-01 0.5626
## WTINT2YR:RIDSTATR NA NA NA NA
## WTINT2YR:RIAGENDR -7.333e-21 1.484e-20 -4.940e-01 0.6212
## WTINT2YR:`nhanes1518$BMXBMI` 7.194e-22 1.182e-21 6.090e-01 0.5429
## WTMEC2YR:RIDSTATR NA NA NA NA
## WTMEC2YR:RIAGENDR 7.441e-21 1.429e-20 5.210e-01 0.6026
## WTMEC2YR:`nhanes1518$BMXBMI` -7.133e-22 1.145e-21 -6.230e-01 0.5332
## RIDSTATR:RIAGENDR NA NA NA NA
## RIDSTATR:`nhanes1518$BMXBMI` NA NA NA NA
## RIAGENDR:`nhanes1518$BMXBMI` 2.472e-18 7.725e-18 3.200e-01 0.7490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.843e-15 on 11080 degrees of freedom
## (752 observations deleted due to missingness)
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.834e+33 on 15 and 11080 DF, p-value: < 2.2e-16
nhanes_na_removed<-cbind(nhanes1518[1:5],nhanes1518$BMXBMI, nhanes1518$INDHHIN2)
nhanes_income <- dplyr::rename(nhanes_na_removed,income = "nhanes1518$INDHHIN2")%>% na.omit()
nhanes_income <- dplyr::rename(nhanes_income,BMI = "nhanes1518$BMXBMI")
# turn quantitative variable into categorical variable
nhanes_income$income<-as.character(nhanes_income$income)
Now we explore the effect of income as a categorical predictor variable on the response BMI. Refer to the website for encoding of income categories:
https://wwwn.cdc.gov/nchs/nhanes/2011-2012/demo_g.htm#INDHHIN2
We want to first drop categories with values 77 (Refused) and 99 (Don’t Know) first:
nhanes_income <- subset(nhanes_income, income!="77" & income!="99")
Then we fit the linear regression on categorical variable income:
summary(lm(BMI ~ income, data = nhanes_income))
##
## Call:
## lm(formula = BMI ~ income, data = nhanes_income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.053 -5.097 -1.153 3.796 55.708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.08676 0.42827 67.917 < 2e-16 ***
## income10 0.25666 0.53138 0.483 0.62910
## income12 0.07855 0.56365 0.139 0.88917
## income13 1.15974 0.72021 1.610 0.10737
## income14 0.35724 0.48585 0.735 0.46218
## income15 -0.60605 0.46019 -1.317 0.18788
## income2 1.16654 0.57274 2.037 0.04170 *
## income3 0.65874 0.52391 1.257 0.20866
## income4 0.39053 0.51242 0.762 0.44599
## income5 0.61757 0.51184 1.207 0.22763
## income6 0.91220 0.47882 1.905 0.05680 .
## income7 0.86657 0.48244 1.796 0.07249 .
## income8 0.96502 0.49816 1.937 0.05275 .
## income9 1.40552 0.51288 2.740 0.00615 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.255 on 10176 degrees of freedom
## Multiple R-squared: 0.006989, Adjusted R-squared: 0.00572
## F-statistic: 5.509 on 13 and 10176 DF, p-value: 4.42e-10
Baseline: income category 1 corresponding to a household income of 0 to 4,999 dollars.
Model Interpretation:
Intercept: The intercept 25.6489 means that for people in the baseline income category (income category 1 corresponding to a household income of 0 to 4,999 dollars), the BMI is expected to be 25.6489 on average.
income6: The coeffcient for the predictor income6 is 1.1473, which means that for participants with household income category 6 (25,000 to 34,999 dollars per ear), the BMI is expected to be 1.1473 higher than that of participants with household income in category 1 (0 to 4,999 dollars), on average.
Multiple linear regression allows to evaluate the relationship between two variables, while controlling for the effect (i.e., removing the effect) of other variables.